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APPARATUS AND METHOD FOR DIVIDING 
DOCUMENT INCLUDING TABLE 

BACKGROUND OF THE INVENTION 
Field of the Invention 

The present invention relates to document 
segmentation apparatus and method for dividing a 
document from content to content, and more particularly 
it relates to document segmentation apparatus and 
method for dividing a document including a table or 
tables . 

Related Background Art 

In the past, information on a web has been 
presented as unit of "page", and arrangement and 
dimension of the page can freely be set by the 
information presenter. Of course, the information 
presenter forms the pages on the basis of his 
information transmitting intention, but, it is not 
necessary that such pages meet a requirement of a 
reader. 

Accordingly, even when a series of topics or 
subjects which are judged to have close relation by the 
presenter are gathered in one page, the reader may not 
want such relation, and, if only one of plural subjects 
is useful, information of the other subjects may be 
obstacle when required information is retrieved. 
Particularly, in mobile equipments having an 
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information presenting space, a function for displaying 
only required information is important. 

Thus, it is important that documents to be 
displayed are divided from content to content 
5 (segmentation) in advance and only a portion which is 
requested by the reader can be presented. In almost 
all of web pages, contents are written by using Hyper 
Text Markup Language (HTML) which is a language to 
compose web pages. Although the HTML is a language for 
10 describing the structure of the document, it is 

difficult to describe details of theoretical structure 
by using the HTML, and a main role of the HTML is to 
designate an layout in the browse. 

However, it is considered that the viewpoint of 
15 the information presenter is reflected to the layout of 
the page. Thus, there has been proposed a technique in 
which the page is divided on the basis of tags of HTML 
in order to generate segments which reflect the 
intention of the information presenter. 
20 In such a technique, a table between < TABLE > tag 

and </ TABLE > tag is judged as one meaningful group and 
is formed as one segment. However, the table 
frequently include a plurality of information which 
assume a relatively great space. 
25 Further, the table can be categorized tables in 

general meaning or for designating the layout of image 
or text. In both bases, tags are used in quite 
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different ways. 

Furthermore, when the table describes the simple 
table, a set of data is represented in a column or in a 
row, or there is a column (or row) with item name or 
5 not; namely, the table has various styles. 

SUMMARY OF THE INVENTION 

An object of the present invention is to divide a 
table into a plurality of segments on the basis of 

10 contents thereof. 

Another object of the present invention is to 
provide a table in a document from content to content, 
by analyzing the table to be processed to judge whether 
the table is a table describing a table in general 

15 meaning or a table as a tool of layout and by 
generating segments accordingly. 

A further object of the present invention is to 
provide a table in general meaning into data segments 
on the basis of the style of the table in general 

20 meaning when the table describes a table in general 
meaning . 

A still further object of the present invention is 
to generate segments on the basis of groups of contents 
when a table is used to obtain layout of image or text. 
25 According to one aspect, the present invention 

which achieves these objectives relates to a document 
segmentation apparatus comprising table analyzing means 



for generating cell position data indicating a 
positional relationship between cells and cell vectors 
representing characteristics of the cells, by analyzing 
a table in a document to be processed, table type 
5 judging means for judging a table type with reference 
to the cell position data and the cell vectors 
generated by the table analyzing means, first segment 
generating means for generating a segment from the 
table when the table type is a table describing a 

10 table, and second segment generating means for 

generating a segment from the table when the table type 
is a table for layout - 

According to another aspect, the present invention 
which achieves these objectives relates to a document 

15 segmentation method comprising a table analyzing step 
for generating cell position data indicating a 
positional relationship between cells and cell vectors 
representing characteristics of the cells, by analyzing 
a table in a document to be processed, a table type 

20 judging step for judging a table type with reference to 
the cell position data and the cell vectors generated 
by the table analyzing step, a first segment generating 
step for generating a segment from the table when the 
table type is a table describing a table, and a second 

25 segment generating step for generating a segment from 
the table when the table type is a table for layout. 
According to still another aspect, the present 
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invention which achieves these objectives relates to a 
computer-readable storage medium storing a document 
segmentation program for controlling a computer to 
perform document segmentation, the program comprising 
5 codes for causing the computer to perform a table 
analyzing step for generating cell position data 
indicating a positional relationship between cells and 
cell vectors representing characteristics of the cells, 
by analyzing a table in a document to be processed, a 
10 table type judging step for judging a table type with 
reference to the cell position data and the cell 
vectors generated by the table analyzing step, a first 
segment generating step for generating a segment from 
the table when the table type is a table describing a 
15 table, and a second segment generating step for 

generating a segment from the table when the table type 
is a table for layout. 

Other objectives and advantages besides those 
discussed above shall be apparent to those skilled in 
20 the art from the description of a preferred embodiments 
of the intention which follows. In the description, 
reference is made to accompanying drawings, which form 
a part thereof, and which illustrate examples of the 
invention. Such examples, however, are not exhaustive 
25 of the various embodiments of the inventions, and 

therefore reference is made to claims which follow the 
description for- determining the scope of the invention. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block diagram showing a functional 
construction of a document segmentation apparatus 
according to a first embodiment of the present 
5 invention; 

Fig. 2 is a block diagram showing a hardware 
construction of the document segmentation apparatus 
according to the first embodiment; 

Fig. 3 is a flow chart showing a procedure of the 
10 document segmentation processing according to the first 
embodiment ; 

Fig. 4 is a view for explaining maximum distance 
algorithm; 

Fig. 5 is a block diagram showing a functional 
15 construction according t a second embodiment of the 
present invention; 

Fig. 6 is a block diagram showing a functional 
construction according to a third embodiment of the 
present invention; 
20 Fig. 7 is a block diagram showing a functional 

construction according to a fourth embodiment of the 
present invention; 

Fig. 8 is a view showing an example of a table in 
an HTML document; 
25 Fig. 9 is a block diagram showing a functional 

construction according to a fifth embodiment of the 
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present invention; 

Fig. 10 is a block diagram showing a construction 
of a table type judgement part according to the fifth 
embodiment ; 

5 Fig. 11 is a flow chart showing a procedure of a 

table type judgement processing according to the fifth 
embodiment ; 

Fig. 12 is a view showing an example of a table in 
an HTML document; 
10 Fig. 13 is a block diagram showing a construction 

of a table type judgement part according to a sixth 
embodiment of the present invention; 

Fig. 14 is a flow chart showing a procedure of a 
table type judgement processing according to the sixth 
1 5 embodiment ; 

Fig. 15 is a view showing an example of a table in 
an HTML document; 

Fig. 16 is a block diagram showing a construction 
of a table type judgement part according to a seventh 
20 embodiment of the present invention; 

Fig. 17 is a flow chart showing a procedure of a 
table type judgement processing according to the 
seventh embodiment ; 

Fig. 18 is a block diagram showing a construction 
25 of a table type judgement part according to an eighth 
embodiment of the present invention; 

Fig. 19 is a flow chart showing a procedure of a 



table type judgement processing according to the eighth 
embodiment ; 

Fig. 20 is a block diagram showing a construction 
of a table type judgement part according to a ninth 
5 embodiment of the present invention; 

Fig. 21 is a flow chart showing a procedure of a 
table type judgement processing according to the ninth 
embodiment ; 

Fig. 22 is a block diagram showing a construction 
10 of a table type judgement part according to a tenth 
embodiment of the present invention; 

Fig. 23 is a flow chart showing a procedure of a 
table type judgement processing according to the tenth 
embodiment ; 

15 Fig. 24 is a view showing an example of a table in 

an HTML document ; 

Fig. 25 is a block diagram showing a functional 
construction of a document segmentation apparatus 
according to an eleventh embodiment of the present 
20 invention; 

Fig. 26 is a flow chart showing a procedure of the 
document segmentation processing according to the 
eleventh embodiment; 

Fig. 27 is a flow chart showing a procedure for 
25 HTML table reformation according to the eleventh 
embodiment; 

Fig. 28 is a view showing an example of a table in 
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an HTML document; 

Figs. 29A, 29B, 29C, 29D and 29E are flow charts 
showing a procedure for HTML table reformation 
according to a twelfth embodiment of the present 
5 invention; 

Figs. 30A, 30B, 30C, 30D, 30E and 30F are views 
showing example of multi-row/multi-column tables; 

Figs. 31A, 31B, 31C and 31D are flow charts 
showing a procedure for HTML table reformation 
10 according to a thirteenth embodiment of the present 
invention; 

Figs. 32A, 32B and 32C are views showing an 
example of a composite table; 

Fig. 33 is a block diagram showing a construction 
15 of an HTML table reformation part according to a 
fourteenth embodiment of the present invention; 

Fig. 34 is a flow chart showing a procedure of the 
HTML table reformation processing according to the 
fourteenth embodiment; 
20 Fig. 35 is a block diagram showing a construction 

of an HTML table reformation part according to a 
fifteenth embodiment of the present invention; 

Fig. 36 is a flow chart showing a procedure of the 
HTML table reformation processing according to the 
25 fifteenth embodiment; 

Fig. 37 is a block diagram showing a construction 
of an HTML table reformation part according to a 



- 10 - 



sixteenth embodiment of the present invention; 

Fig. 38 is a flow chart showing a procedure of the 
HTML table reformation processing according to the 
sixteenth embodiment; 
5 Fig. 39 is a block diagram showing a construction 

of an HTML table reformation part according to a 
seventeenth embodiment of the present invention; and 

Fig. 40 is a flow chart showing a procedure of the 
HTML table reformation processing according to the 
10 seventeenth embodiment. 

DESCRIPTION OF THE REFERRED EMBODIMENTS 

The present invention will now be explained in 

connection with preferred embodiments thereof with 
15 reference to the accompanying drawings. 

[First Embodiment] 

Fig. 1 is a block diagram showing a functional 

construction of a document segmentation apparatus 

according to a first embodiment of the present 
20 invention. In Fig. 1, an HTML table storage part 101 

serves to hold or store a table (portion between 

<table> and </table>) in the HTML document to be 

processed. 

A table analysis part 102 serves to analyze the 
25 table stored in the HTML table storage part 101 and to 
generate cell position data representing a positional 
relationship between cells and cell vectors 
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representing characteristics of the cells. 

The cell vector is determined by height and width 
of the cell, a displaying position of contents, a 
background color, length and character type of a text 
in the cell, and magnitude and shape of an image in the 
cell. The dimension of the cell is (number of images 
in cell x 4 + 17) dimensions, and each component is a 
real number greater than 0 and smaller than 1. When it 
is assumed that the image which firstly appears in the 
cell is image 1( k-th component v(k) of the cell vector v 
is defined as follows: 

v(0) : when a kind of a tag is <TH> (cell 

representing item name), 1,0, and when <TD> 

(cell representing data), 0.0 
v(l) : when rowspan (row width) is below 4, 

rowspan x 0.25, and when the rowspan is 

above 4, 1.0 

v(2) : when clospan (column width) is below 4, 
colspan x 0.25, and when the clospan is 
above 4 , 1.0 

v(3) : when nowrap (no line space) is designated, 
1.0, and when not designated, 0.0 

v(4) : when align (lateral position) is not 

designated, 0.0, and when left (left end), 
0.2, and when center (central position), 
0.4, and when right (right end), 0.6, and 
when justify (uniform), 0.8, and when others. 
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1.0 

v(5) : when valign (vertical position) is not 

designated, 0.0, and when top (upper end), 

0.2, and when middle (center), 0.4, when 

bottom (lower end), 0.6, and when baseline, 

0.8, and when others, 1.0 
v(6) : when bgcolor (background color) is not 

designated, 0.0, and when not designated by 

16-scale code, 0.0, and when designated by 

16-scale code, bgcolor/ OxFFFFF 
v(7) : before ninth row, (row number) x 0.1, and 

after tenth row, 1.0 
v(8) : before 99-th column, (column number) x 0.01, 

and after 100-th column, 1.0 
v(9) : when the number of line spaces ( <BR> ) is 

below 5, ( <BR> number) x 0.2, and when <BR> 

number is above 5, 1.0 
v(10) : when the number of characters in text is 

below 100, (number of characters) x 0.01, 

and when above 100, 1.0 
v(ll) : (number of numerals in text)/ (total number of 

characters in text ) 
v(12) : (number of alphabets in text) /(total number 

of characters in text ) 
v(13) : (number of Kanji in text) /(total number of 

characters in text) 
v(14) : (number of Katakana in text)/ (total number of 
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characters in text) 

(15) : (number of Hiragana in text)/ (total number of 

characters in text ) 

(16) : when there is punctuation point "o" or " . " ) , 

1.0, and when no punctuation point, 0.0 

(13+ix4) : when an area of imagej^ is below 150000, 

(area)/150000, and when above 150000, 1.0 

-(14+ix4) : when a height of image t is below 300, 
( height )/300, and when above 300, 1.0 

-(15 + ix4) : when a width of image 1 is below 500, 
(width)/ 500, and when above 500, 1.0 

r (16+ix4) : among character rows representing URL of 
page containing this table, a ratio of 
partial character rows common to URL of 
image^ For example, if an image 
" . . /image/hoge. gif " " is included in a 
page "http: //hogehoge. aaa. bbbbb . co . jp : 
8080/hogel/hoge2/hoge.html (length of URL 
is 58), when the image is rewritten to 
fullpass URL , since "http://hogehoge.aaa. 
bbbbb. co. jp: 8080/hogel/image/hoge. gif " " is 
obtained, the common character row becomes 
"http: //hogehoge. aaa. bbbbb. cc. jp: 8080/ 
hogel/". Since this length is 43, a value 
of this component becomes 43 t 58 = 0, 
i.e., 741. 

A cell vector storage part 103 is a cell position 
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data storage part for storing cell position data 
generated by the table analysis part 102. A cell 
vector storage part 104 serves to store the cell 
vectors generated by the table analysis part 102. 
5 A table type judgement part 105 serves to judge a 

type of the table with reference to the cell position 
data stored in the cell position data storage part 103 
and the cell vectors stored in the cell vector storage 
part 104 and to instruct a cut direction determination 
10 part 107 or a cell cluster generation part 111 to 
start the processing in dependence upon the table 
type. There are seven table types from table 1 to 
table VII which will be described below, 
table I : heights and widths of all of the cells are 
15 1, and the cells in first column/n-th row and 

n-th column/first row are all <TH> or same 

background color, 
table II : heights and widths of all of the cells are 

1, and the cells in first column/n-th row and 
20 n-th column/first row (except for first column/ 

first row) are all <TH> or same background color, 
table III : heights and widths of all of the cells are 

1, and the cells in first column/n-th row are all 

<TH> or same background color. 
25 table IV : heights and widths of all of the cells are 

1, and the cells in first column/n-th row (except 

for first column/ first row) are all <TH> or same 



background color, 
table V : heights and widths of all of the cells are 

1, and the cells in n-th column/first row are all 

<TH> or same background color. 
5 table VI : heights and widths of all of the cells are 

1, and the cells in n-th column/first row (except 

for first column/first row) are all <TH> or same 

background color, 
table VII : tables other than table I to table VI. 
10 In the above, the tables I to VI are tables 

showing tables as they are and the table VII is a 
table used as a tool for the purpose of layout. When 
the table type is any one of the tables I to VI, the 
cut direction determination part 107 is instructed to 
15 start the processing, and when the table type is the 
table VII, the cell cluster generation part is 
instructed to start the processing. 

A table type storage part 106 serves to store the 
table type determined by the table type judgement part 
20 105. 

When the cut direction determination part 107 is 
instructed to start the processing by the table type 
judgement part 105, the part 107 judges whether each 
data is expressed by column or row in the table 
25 describing "table", with reference to the cell 

position data stored in the cell position data storage 
part 103 and the cell vectors stored in the cell 
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vector storage part 104, thereby determining the table 
division direction. 

A score S h (T) when a table T of N-th column/M-th 
row is divided on the basis of column and a score S V (T) 
when the table T of N-th column/M-th row is divided on 
the basis of row are defined as follows. In the 
following description, cos(Vi d , v kl ) represents a 
cosine value between a table cell vector v i;j in i-th 
column/ j-th row and a table cell vector v k x in k-th 
column/ 1-th row. 

However, these are values calculated only when 
there are both the data of cell in the i-th 
column/ j-th row and data of cell in the k-th 
column/l-th row, and if either of both data is not 
existed, the value becomes zero. 

exist(i,j) = 1 (data is existed in the cell in 
the i-th column/ j -th row) 
0 (data is not existed in the cell 
in the i-th column/ j-th row) 

count h = Y V V exist (i, j) XexistG, k) 
count, = ^ g exist (i. j) xexist (1, j) 

i N M M 

s h a) = — — y y y cos(y u . v,. k ) 



count 

-i M K N 

s y (t) = — - — y y y cos(v M , v,,) 

count v ff U fti 
Since the dimension of the table cell vectors is 
determined by the number of images includes in the 
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cells in i-th column/ j-th row and k-th column/ 1-th 
row, the cosine value is calculated by adding 
component having a value of zero to the lower table 
cell vectors so that the dimensions of both vectors 
5 becomes the same . 

S h (T) is an average cosine value between two cell 
table cell vectors in the same column, and S V (T) is an 
average cosine value between two cell table cell 
vectors in the same row. Since the cosine values of 

10 the two table cell vectors can be regarded as 

similarity of cells, it is said that S h (T) is average 
similarity between the cells in the same segment when 
the table id divided from column to column and S V (T) is 
average similarity between the cells in the same 

15 segment when the table id divided from row to row. 

Since it is better that the similarity between 
the cells in the same segment is low in order to 
incorporate various data in the cells, it is judged 
that when S h (T) < S V (T) the table T should be divided 

20 from column to column and when S h (T) > S V (T) the table 
T should be divided from row to row. 

A cut direction storage part 108 serves to store 
the cut directions determined by the cut direction 
determination part 107. 

25 A table segment generation part 109 serves to 

generate the segment from the table describing the 
table with reference to the table types stored in the 



table type storage part 106 and the cut directions 
stored in the cut direction storage part 108. When 
the cut direction is column direction, in the table of 
table V type, the columns are made to segments as they 
5 are, and, in the tables other than the table V type, 
the segment is formed by combining the first column. 
When the cut direction is row direction, in the table 
of table III type, the rows are made to segments as 
they are, and, in the tables other than the table III 
10 type, the segment is formed by combining the first 
row. 

A table segment storage part 110 serves to store 
the table segment generated by the table segment 
generation part 109. 

15 A cell cluster storage part 111 serves to effect 

clustering of cells in the table having the purpose of 
layout with reference to the cell vectors stored in 
the cell vector storage part 104 when the starting of 
processing is instructed by the table type 

20 determination part 105. Here, sorting of cells is 

determined by using maximum distance algorithm. Now, 
clustering procedure of the maximum distance algorithm 
will be described. 

Step 1 : From N ( number ) sample pattern concurrent 
25 X ( = K lr x 2 , . . . , x N ) , any one of sample (for 

example, x x ) is selected and is made as cluster 
center z x €Z. 
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Step 2 : Regarding all of x t ex not included in Z, 

among cluster centers z.,ez, a distance dxj to the 
nearest cluster center is calculated. It is 
assumed that x ± giving Max{dx 1 } is x 0 . 

Step 3 : Regarding all of z k 6Z, among cluster centers 
other than z k , a distance dz k to the furthest 
cluster center is calculated. 

Step 4 : When dx c > max{dz k } xt(t = 0.5tol)is 
established, it is regarded as a new cluster 
center, and the algorithm is returned to the 
Step 2 to select next cluster center. If 
dx c < max{dz k } x t (t = 0.5 to 1), the algorithm 
goes to Step 5. 

Step 5 : All of x ± ex is stored to clusters of the 

nearest Zj£Z. 

An example of a clustering result based on the 

maximum distance algorithm is shown in Fig. 4. 

A cell cluster information storage part 112 

serves to store cell cluster information generated by 

the cell cluster generation part 111. 

A layout segment generation part 113 serves to 

generate the segments from the table having purpose of 

layout with reference to the cell position data stored 

in the cell position data storage page 103 and the 

cell cluster information stored in the cell cluster 

information storage part 112. 

The merit for arranging the information by 
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utilizing the types of the tables is that longitudinal 
and lateral repeating of a certain arrangement pattern 
can easily be attained. Thus, the arrangement pattern 
is guessed on the basis of the cell cluster 
5 information, and the segment is obtained by combining 
the cells matched to the pattern, because, when a 
certain arrangement pattern appears repeatedly, it can 
be judged that the cells matched to said pattern are 
resembled meaningly. Details of processing will now 

10 be described. 

First of all, a fundamental cell kind is 
determined, and cells in the fundamental cell kind are 
regarded as fundamental cells. The fundamental cell 
kind is selected as a cell kind having least number of 

15 cells among cell kinds including a plurality of same 
cells. If there are a plurality of cell kinds in 
question, leftmost or uppermost cell kind is selected. 

Then, it is ascertained whether any cell having 
the same kind of the cell adjacent to the fundamental 

20 cell is adjacent to another fundamental cell or not. 
If adjacent, the fundamental cells are connected to 
obtain a new fundamental cell. This procedure is 
repeated until the cells cannot be interconnected. 

When the above-mentioned process is finished, the 

25 fundamental cell and the remaining cells are made to 
segments, respectively. 

A layout segment storage part 114 serves to store 
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■the layout segments generated by the layout segment 
generation part 113. The table segments stored in the 
table segment storage part 110 and the layout segments 
stored in the layout segment storage part 114 are 
5 segments eventually obtained. 

Fig. 2 is a view showing a hardware arrangement 
of the document segmentation apparatus according to 
the illustrated embodiment. 

In Fig. 2, a CPU 201 serves to effect the 

10 processing in accordance with program stored in a ROM 
202. The ROM 202 serves to store program performing 
control procedure which will be described later. A 
RAM 203 serves to provide storing areas required for 
operating the cell position data storage part 103, 

15 cell vector storage part 104, table type storage part 
106, cut direction storage part 108, cell cluster 
information storage part 112 and the aforementioned 
program. 

A disk drive device 204 serves to realize the 
20 HTML table storage part 101, table segment storage 

part 110, and layout segment storage part 114. A buss 
205 serves to connect between the above-mentioned 
elements and permit sending and receiving of data 
between the elements . 
25 Next, a processing operation of the illustrated 

embodiment will be explained. Fig. 3 is a flow chart 
showing an operation procedure of the document 
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segmentation apparatus according to the illustrated 
embodiment . 

In a step S301, the tables stored in the HTML 
table storage part 101 are analyzed to generate the 
5 cell position data representing the positional 

relationship between the cells and the cell vectors 
representing characteristics of the cells. Then, the 
program goes to a step S302. 

In the step S302, the table type is determined 

10 with reference to the cell position data stored in the 
cell position data storage part 103 and the cell 
vectors stored in the cell vector storage part 104. 
Then, the program goes to a step S302. 

In the step S303, it is judged whether the table 

15 to be processed is the table describing the table or 
not with reference to the table types stored in the 
table type storage part 106. If the table is the 
table describing the table, the program goes to a step 
S304. If not, the program goes to a step S306. 

20 In the step S304, it is determined whether the 

data in the table describing the table are represented 
by column or row with reference to the cell position 
data stored in the cell position data storage part 103 
and the cell vectors stored in the cell vector storage 

25 part 104, thereby determining the dividing direction 
of the table. Then, the program goes to a step S305. 

In the step S305, the segments are generated from 
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the table showing the table as it is with reference to 
the table types stored in the table type storage part 
106 and the cut directions stored in the cut direction 
storage part 108. And, the operation is finished. 
5 In the step S306, the cells in the table used as 

a tool for the purpose of layout are clustered with 
reference to the cell vectors stored in the cell 
vector storage part 104. Then, the program goes to a 
step S307. 

10 In the step S307, the segments are generated from 

the table describing the table with reference to the 
cell position data stored in the cell position data 
storage part 103 and the cell cluster information 
stored in the cell cluster information storage part 

15 112. And, the operation is finished. 

As mentioned above, by analyzing the table to be 
processed to judge whether the table is the table 
describing the table or the table having purpose of 
layout and by generating the segments by effecting the 

20 processing for obtaining the target table, the tables 
in the HTML document can be divided according to the 
contents . 
[Alterations] 

In the above-mentioned embodiments, while an 

25 example that the maximum distance algorithm is used 
for effecting the clustering of the cells was 
explained, the present invention is not limited to 
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such an example, but, the clustering may be effected 
by using other algorithms. 

The definition of the components of the cell 
vectors shown in the illustrated embodiment is merely 
5 one example, and, the characteristics of the cells may 
be expressed by other definitions. 

The definition of score for determining the cut 
direction shown in the illustrated embodiment is 
merely one example, and, the cut direction may be 

10 determined by other definitions. 

In the illustrated embodiment, while an example 
that the height and width of the cell, kind of the tag 
( TH or TD) and the background color are used to 
determine the column (or row) of the item name for 

15 determining the table type was explained, the present 
invention is not limited to such an example, but, 
judgement may be effected by using other attributes. 

In the illustrated embodiment, while an example 
that the cell position data storage part 103, cell 

20 vector storage part 104, table type storage part 106, 
cut direction storage part 108 and cell cluster 
information storage part 112 are realized by the RAM 
and the HTML table storage part 101, table segment 
storage part 110 and layout segment storage part 114 

25 are realized by the disk drive device was explained, 
the present invention is not limited to such an 
example, but, these may be realized by using any 
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recording medium. 

In the illustrated embodiment, while an example 
that the HTML table is divided was explained, so long 
as the contents of the table can be discriminated, 
5 another type of table may be divided. 

In the illustrated embodiment, while an example 
that the elements are incorporated into the same 
computer was explained, the present invention is not 
limited to such an example, but, the elements may be 
10 individually incorporated into computers or processing 
devices included in a network. 

In the illustrated embodiment, while an example 
that the program is stored in the ROM was explained, 
the present invention is not limited to such an 
15 example, but, the program may be stored in any 
recording medium. Further, the program may be 
realized by any circuit performing the same operation. 
[Second Embodiment] 

In the above-mentioned embodiment, while an 
20 example that the apparatus serves to divide only the 

HTML table was explained, the present invention is not 
limited to such an example. For example, the present 
invention may be realized as an apparatus for dividing 
the entire HTML document. Fig. 5 is a block diagram 
25 showing a fundamental construction in such a case. 

In Fig. 5, a HTML document storage part 501 
serves to store an HTML document to be processed. 
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A normal segment generation part 502 serves to divide 
the HTML document stored in the HTML document storage 
part 501 into segments- A normal segment storage part 
503 serves to store segments other than tables 
5 generated by the normal segment generation part 502. 
An HTML table storage part 101 serves to store 
segments of tables generated by the normal segment 
generation part 502. Others are the same as those 
shown in Fig. 1. 
10 In Fig. 5, the normal segments stored in the 

normal segment storage part 503, the table segments 
stored in the table segment storage part 110 and the 
layout segments stored in the layout segment storage 
part 114 are the segments eventually obtained. 
15 [Third Embodiment] 

In the above-mentioned embodiments, while an 
example that both the table showing the table as it is 
and the table used as a tool for the purpose of layout 
are divided into the segments was explained, the 
20 present invention is not limited to such an example. 

For example, only the table showing the table as it is 
may be divided. Fig. 6 is a block diagram showing a 
fundamental construction in such a case. 

In Fig. 6, when a table segment generation part 
25 601 is instructed to start processing from a table 
type judgement part 105, it generates an HTML table 
stored in an HTML table storage part 101 as table 
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segments . 

A table segment storage part 602 serves to store 
table segments generated by a table segment generation 
part 611. Others are the same as those shown in Fig. 
5 1. 

In Fig. 6, the table segments stored in the table 
segment storage part 110 and the table segments stored 
in the table segment storage part 602 are the segments 
eventually obtained. 

10 [Fourth Embodiment] 

In the above-mentioned embodiments, while an 
example that both the table showing the table as it is 
and the table used as a tool for the purpose of layout 
are divided into the segments was explained, only the 

15 table used as a tool for the purpose of layout may be 
divided. Fig. 7 is a block diagram showing a 
fundamental construction in such a case. 

In Fig. 7, when a table segment generation part 
701 is instructed to start processing from a table 

20 type judgement part 705, it generates an HTML table 
stored in an HTML table storage part 101 as table 
segments. A table segment storage part 702 serves to 
store table segments generated by a table segment 
generated part 706. Others are the same as those 

25 shown in Fig. 1. 

In Fig. 7, the table segments stored in the table 
segment storage part 702 and the layout segments 
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stored in the layout segment storage part 114 are the 

segments eventually obtained. 

Incidentally, in the above-mentioned embodiment, 

while an example that the present invention is applied 
5 to the apparatus for dividing the HTML document was 

explained, the present invention is not limited to 

such an example, but, the present invention may be 

realized as a segment retrieving apparatus in which 

retrieval can be effected for each segment unit by 
10 combining the dividing apparatus with a retrieving 

apparatus . 

[Fifth Embodiment] 

In the above-mentioned embodiments, while an 

example that the judgement whether the table is the 
15 table showing the table as it is or not is effected 

only on the basis of the syntax of the table was 

explained. 

However, among the HTML documents tables, since 
there are also tables in which table items are not 

20 described by emphasizing characters to permit 

discrimination as TH tags or item name, the table 
describing the table may be judged as layout. In such 
a case, the approach only from the syntax has 
limitation for judging whether the table is the table 

25 describing the table or not. 

Now, referring to an example shown in Fig. 8, 
since meanings between the cells are analogous with 
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each other, it can be seen that each cell forms an 
element for one item. In this way, among the HTML 
document tables, there are also tables which can be 
discriminated as table showing table as it is by 
5 semantics. 

Thus, in a fifth embodiment of the present 
invention, the judgement whether the table is the 
table describing the table or not is effected on the 
basis of approach from the semantics. 

10 Fig. 9 is a block diagram showing a construction 

of an apparatus according to the fifth embodiment. 

In a table analysis part 102, a table stored in 
an HTML table storage part 101 is analyzed to generate 
cell position data representing a positional 

15 relationship between cells, cell vectors representing 
characteristics of the cells and data for cells. 
A cell data storage part 901 serves to store the cell 
data generated by the table analysis part 102. 
Others are the same as those shown in Fig. 1. 

20 The processing procedure according to the 

illustrated embodiment is effected in accordance with 
the flow chart shown in Fig. 3, as is in the first 
embodiment. However, since there is slight detailed 
differences from the first embodiment, such 

25 differences will be described. 

In a step 301, the table stored in the HTML table 
storage part 101 is analyzed to generate cell position 
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data representing a positional relationship between 
cells, cell vectors representing characteristics of 
the cells and data for cells. And, the program goes 
to a step S302. 
5 In the step S302, the table type is determined 

with reference either the cell position data stored in 
the cell position storage page 103 or the cell vectors 
stored in the cell vector storage part 104 or the cell 
data stored in the cell data storage page 901. 

10 And, the program goes to a step S303. 

Here, the determination of the table type 
includes determination of table type on the basis of 
thesaurus, determination of table type on the basis of 
similarity of character, determination of table type 

15 on the basis of syntax and determination of table type 
on the basis of coincidence of character. 
An operation for determining the table type will be 
described in connection with embodiments which will be 
described later. The step S303 and other steps are 

20 the same as those in the first embodiment. 

In the illustrated embodiment, the table 
judgement part 105 includes a thesaurus similarity 
judgement part 1001 and a thesaurus dictionary 1002. 
Now, an operation will be explained with reference to 

25 Fig. 10. 

The "thesaurus" is a word for meaning a high/low 
rank relationship between vocabularies. Words include 
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high rank words which are more abstract, a synonym 
which does not change in meaning even if expressed by 
other word, analogous words which are resembled in 
meaning, and low rank words which are more concrete. 
5 For example, a word "morning glory includes "flower" 

as the high rank word and "violet", "convolvulus" and 
"balsam" as analogous words. A word "flower" includes 
"violet", "convolvulus" and "balsam" as the low rank 
words . 

10 The thesaurus similarity judgement part 1001 

serves to judge the table type on the basis of 
thesaurus similarity described in the thesaurus 
dictionary 1002 with reference to the cell position 
data stored in the cell position data storage part 103 

15 and the cell data stored in the cell data storage part 
115, and the judged table type is stored in the table 
type storage part 106. 

Now, the judgement of the table type based on the 
thesaurus similarity will be explained with reference 

20 to an example of an M column/N row table. 

A function for obtaining score based on the 
thesaurus for two character lines si, s2 is expressed 
as f(sl, s2 ) . When the character line s2 is the 
synonym or analogous word with the character line si, 

25 the value of f(sl, s2 ) becomes maximum. It is assumed 
that, as the character line s2 with respect to the 
character line si becomes gradually deeper in the high 
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rank word direction or the low rand word direction, 
the value f(sl, s2 ) becomes smaller. 

When it is assumed that a character line of m-th 
column/n-th row cells is S m n , the average score of 
5 thesaurus for cells in the first row can be expressed 
as follows: 

Similarly, the average score of thesaurus for cells in 
10 the first column can be expressed as follows: 

If the average score of thesaurus for cells in the 
first column or row exceeds a threshold value, it is 
15 judged as the table describing the table, and, if the 
average score does not exceed the threshold value, it 
is judged as the table describing the layout. In this 
way, the table type of the table to be processed can 
be judged. 

20 As a method for obtaining the score based on 

similarity of character regarding two character lines 
si, s2, there is a method called as "vague retrieval". 

A function for obtaining score based on the 
similarly of character for two character lines si, s2 

25 is expressed as g(sl, s2 ) . When it is assumed that if 
the similarity of character is great a value of 
g(sl, s2 ) becomes greater and if the similarity of 
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character is great the value of g(sl, s2 ) becomes 
smaller, by using the vague retrieval, similar to the 
method for obtaining the score on the basis of the 
thesaurus, if the average score of similarity of 
5 character for cells in the first column or row exceeds 
a threshold value, it is judged as the table 
describing the table, and, if the average score does 
not exceed the threshold value, it is judged as the 
table describing the layout. In this way, the table 

10 type of the table to be processed can be judged. 

In the illustrated embodiment, regarding the 
table to be processed, first of all, the judgement of 
the table based on the thesaurus is effected, and, if 
the table is the table describing the table, the 

15 procedure is ended, and, if the table is not the table 
describing the table, the table judgement based on the 
similarity of character is effected regarding the 
table to be processed. 

In this way, the table type of the table to be 

20 processed can be effected on the basis of the 
thesaurus similarity. 

Now, the details of the table judgement in the 
step S302 will be explained with reference to Fig. 11. 
In a step S1101, from the cell position data 

25 stored in the cell position data storage part 103 and 
the cell data stored in the cell data storage part 
901, the type of the table to be processed is judged 
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on the basis of the thesaurus, and, if the table is 
the table describing the table, the procedure is 
ended, and, if the table is not the table describing 
the table, the program goes to a step S1102. 
5 In the step SI 102, from the cell position data 

and the cell data, the type of the table to be 
processed is judged on the basis of the similarity of 
character. Then, the procedure is ended. 

Here, an example of the table of page regarding 

10 "How to Rear Flowers" shown in Fig. 8 will be 

explained. First of all, the average scores of 
thesaurus for cells in the first column and the first 
row are measured. In the first column, it can be seen 
that words "violet", "morning glory" and "balsam" are 

15 included. These words are words regarding the flower. 
Accordingly, the average score of thesaurus regarding 
the cells in the first column becomes great, and, 
thus, this table can be judged as the table describing 
the table. 

20 Next, an example of a table regarding "A Page of 

Products Catalog" shown in Fig. 12 will be explained. 
First of all, the average scores of similarity of 
character for cells in the first column and the first 
row are measured. In the first column, it can be seen 

25 that words " AAA0001 " , "AAA0002" and " AAA1001 " are 
included. These words are analogous words. 
Accordingly, the average score of similarity of 
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character regarding the cells in the first column 
becomes great, and, thus, this table can be judged as 
the table describing the table. 

As mentioned above, by analyzing the table to be 
5 processed on the basis of the semantics to judge 

whether the table is the table describing the table or 
the table having purpose of layout and by generating 
the segments accordingly, the table in the HTML 
document can be divided from content to content. 

10 [Sixth Embodiment] 

In a sixth embodiment of the present invention, a 
table judgement portion 105 includes a partial 
character line extracting part 1301 and a character 
line comparison part 1302. An operation will be 

15 explained with reference to Fig. 13. 

In the partial character line extracting part 
1301, partial character lines of the cells are 
extracted with reference to the cell position data 
stored in the cell position data storage part 103 and 

20 the cell data stored in the cell data storage part 

901 . The extraction of the partial character line is 
effected by using a known method such as geometric 
element analysis. 

In the character line comparison part 1302, the 

25 partial character lines of the cells extracted in the 
partial character line extracting part 1301 are 
compared, so that the table type is judged depending 
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upon whether the character lines are coincided with 
each other in many cells or not. The judged table 
type is stored in the table type storage part 106. 

Now, the judgement of the table type based on the 
5 character line comparison will be explained with 

reference to an example of an M-th column/N-th row 
table. 

A function for obtaining coincidence of character 
line regarding two character lines si, s2 is expressed 
10 as h(sl, s2). It is assumed that, if h(sl, s2 ) * 0, 
two character lines do not coincide, and, if 
h(sl, s2) = 0, two character lines coincide with each 
other . 

When it is assumed that a character line of m-th 
15 column/n-th row cell is S m n and a k-th partial 

character line from the top when S m n is divided into 
the partial character lines is s k m n , an average of 
coincidence of character line regarding the last 
character lines of the cells in the first column can 
20 be expressed as follows: 

S m i j , S° 4 represent the last partial character lines 
in the respective character lines. Similarly, an 
25 average of coincidence of character line regarding the 
last character lines of the cells in the first row can 
be expressed as follows: 
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If the average of coincidence of character line 
regarding the cells in the first column or the first 
5 row does not exceed a threshold value, it is judged as 
the table describing the table, and, if the average 
exceeds the threshold value, it is judged as the table 
describing the layout. In this way, the type of the 
table to be processed can be judged. After the 
10 processing, the judged table type is stored in the 

table type storage part 106. In this way, the table 
type can be judged on the basis of the character line 
comparison. 

Now, the details of the table judgement in the 
15 step S302 will be explained with reference to Fig. 14. 

In a step S1401, from the cell position data and 
the cell vectors, the partial character line is 
extracted. And, the program goes to a step S1402. 

In the step S1402, the partial character lines of 
20 the cells are compared, and the table type is judged 
depending upon whether the character lines are 
coincided with each other in many cells or not. 
And the procedure is ended. 

Now, an example of a table regarding "A Page of 
25 Medical Centers" shown in Fig. 15 will be explained. 

First of all, the cells in the first column and 
the first row are divided into partial character lines 
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by using the geometric element analysis. When the 
cells in the first row are divided into the partial 
character lines, "A clinic", "B clinic" and "C clinic" 
are obtained. When the character line comparison is 
5 effected between the last partial character lines of 
the cells, since "clinic" coincides, the average of 
coincidence of character line regarding the cells in 
the first row becomes small, and, thus, it can be 
judged as the table describing the table. 

10 As mentioned above, by analyzing the coincidence 

of partial character line of cells to judge whether 
the table to be processed is the table showing the 
table or the table having purpose of layout and by 
generating the segments accordingly, the table in the 

15 HTML document can be divided from content to content. 
[Seventh Embodiment] 

In a seventh embodiment of the present invention, 
a table judgement portion 105 includes a partial 
character line extracting part 1601, thesaurus 

20 similarity judgement part 1602, and a thesaurus 

dictionary 1603. An operation will be described with 
reference to Fig. 16. 

In the partial character line extracting part 
1301, a partial character lines are extracted with 

25 reference to the cell position data stored in the cell 
position data storage part 103 and the cell data 
stored in the cell data storage part 115. 
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In the thesaurus similarity judgement part 1602, 
regarding the partial character lines of the cells 
extracted in the partial character line extracting 
part 1301, the table type is judged on the basis of 
5 thesaurus similarity of the thesaurus dictionary 1603, 
and the judged table type is stored in the table type 
storage part 106. 

Now, the details of the table judgement in the 
step S302 will be explained with reference to Fig. 17. 
10 In a step S1701, from the cell position data and 

the cell vectors, the partial character line is 
extracted. And, the program goes to a step S1702. 

In the step S1702, regarding the partial 
character liens of the cells, the table judgement 
15 based on thesaurus is effected. As a result, in a 
step 1703, if the table is the table describing the 
table, the procedure is ended; otherwise, the program 
goes to a step S1704. 

In the step SI 704, regarding the partial 
20 character lines of the cells, the table judgement 
based on similarity of character is effected. 
And, the procedure is ended. 

As mentioned above, by judging the table type of 
the table to be processed on the basis of the 
25 thesaurus similarity regarding the partial character 
lines of the cells to judge whether the table is the 
table showing the table or the table having purpose of 
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layout and by generating the segments accordingly, the 
table in the HTML document can be divided from content 
to content. 
[Eighth Embodiment] 
5 In an eighth embodiment of the present invention, 

a table judgement portion 105 includes a syntax 
judgement part 1801, a thesaurus similarity judgement 
part 1802, and a thesaurus dictionary 1803. An 
operation will be described with reference to Fig. 18. 

10 The syntax judgement part 1801 serves to effect 

the processing similar to the table type judgement 
part 105 of the first embodiment. After the 
processing in the syntax judgement part 1801 or the 
thesaurus similarity judgement part 1802, the judged 

15 table type is stored in the table type storage part 
106. 

Now, the details of the table judgement in the 
step S302 will be explained with reference to Fig. 19. 

In a step S1901, from the cell position data and 
20 the cell vectors, the table type is judged on the 

basis of syntax. As a result, in a step 1902, if the 
table is the table describing the table, the procedure 
is ended; otherwise, the program goes to a step S1903. 

In the step S1903, from the cell position data 
25 and the cell vectors, the table type is judged on the 
basis of thesaurus. As a result, in a step 1904, if 
the table is the table describing the table, the 
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procedure is ended; otherwise, the program goes to a 
step S1905. 

In the step S1905, from the cell position data 
and the cell vectors, the table type is judged on the 
5 basis of similarity of character. And, the procedure 
is ended. 

As mentioned above, by analyzing the table type 
of the table to be processed on the basis of syntax 
and semantics to judge whether the table is the table 

10 showing the table or the table having purpose of 

layout and by generating the segments accordingly, the 
table in the HTML document can be divided from content 
to content . 
[Ninth Embodiment] 

15 In a ninth embodiment of the present invention, a 

table judgement portion 105 includes a syntax 
judgement part 2001, a partial character line 
extracting part 2002, and a character line comparison 
part 2003. An operation will be described with 

20 reference to Fig. 20. 

The syntax judgement part 1801 serves to effect 
the processing similar to the table type judgement 
part 105 of the first embodiment. The partial 
character line extracting part 2002 and the character 

25 line comparison part 2003 serve to effect the 

processing similar to the partial character line 
extracting part 1301 and the character line comparison 
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part 1302 of the sixth embodiment. After the 
processing in the syntax judgement part 2001 or the 
character line comparison part 2003, the judged table 
type is stored in the table type storage part 106. 
5 Now, the details of the table judgement in the 

step S302 will be explained with reference to Fig. 21. 

In a step S2101, from the cell position data and 
the cell vectors, the table type is judged on the 
basis of syntax. As a result, if the table is the 

10 table describing the table, the procedure is ended; 
otherwise, the program goes to a step S2102. 

In the step S2102, from the cell position data 
and the cell vectors, the partial character lines are 
extracted, and, in a step S2103, the partial character 

15 lines of the cells are compared, so that the table 
type is judged depending upon whether the partial 
character lines are coincided with each other in many 
cells or not. And, the procedure is ended. 

As mentioned above, by analyzing the table type 

20 of the table to be processed on the basis of syntax 
and the coincidence of the partial character line to 
judge whether the table is the table describing the 
table or the table having purpose of layout and by 
generating the segments accordingly, the table in the 

25 HTML document can be divided from content to content. 
[Tenth Embodiment] 

In a tenth embodiment of the present invention, a 
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table judgement portion 105 includes a syntax 
judgement part 2201, a partial character line 
extracting part 2202, a thesaurus similarity judgement 
part 2203 and a thesaurus dictionary. An operation 
5 will be described with reference to Fig. 22. 

The syntax judgement part 2201 serves to effect 
the processing similar to the table type judgement 
part 105 of the first embodiment. The partial 
character line extracting part 2202 and the thesaurus 

10 similarity judgement part 2203 serve to effect the 
processing similar to the partial character line 
extracting part 1601 and the thesaurus similarity 
judgement part 1602. After the processing in the 
syntax judgement part or the thesaurus similarity 

15 judgement part, the judged table type is stored in the 
table type storage part 106. 

Now, the details of the table judgement in the 
step S302 will be explained with reference to Fig. 23. 
In a step S2301, from the cell position data and 

20 the cell vectors, the table type is judged on the 

basis of syntax. As a result, in a step S2302, if the 
table is the table describing the table, the procedure 
is ended; otherwise, the program goes to a step S2303. 
In the step S2303, from the cell position data 

25 and the cell vectors, the partial character lines are 
extracted, and, in a step S2304, regarding the partial 
character lines of the cells, the table judgement is 
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effected on the basis of thesaurus. As a result, in a 
step S2305, if the table is the table describing the 
table, the procedure is ended; otherwise, the program 
goes to a step S2306. In the step S2306, regarding 
5 the partial character liens of the cells, the table 
judgement is effected on the basis of similarity of 
character. And, the procedure is ended. 

As mentioned above, by analyzing the table type 
of the table to be processed on the basis of syntax 

10 and analyzing the partial character lines of the cells 
to judge whether the table is the table describing the 
table or the table having purpose of layout and by 
generating the segments accordingly, the table in the 
HTML document can be divided from content to content. 

15 In the above-mentioned embodiments, when the 

judgement whether the table is the table describing 
the table or not, by utilizing the table judgement 
based on semantics as well as the table judgement 
based on syntax, regarding many tables, it is possible 

20 to judge whether such table is the table describing 
the table or not. 
[Eleventh Embodiment] 

Now, naming regarding the table will be briefly 
described . 

25 "Record" is information representing one 

substance, and a group of records representing similar 
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substances constitute record concurrence. Of course, 
styles of the records in the record concurrence are 
the same. The record is constituted by fields (data) 
representing attributes of the substances. 
5 For example, "Taro Yamada: Yokohama-city: 

045-000-0000" is a record constituted by three fields. 
"Hanako Yamada: Kawasaki-chi : 044-111-1111" is also a 
record representing a person in the same manner as the 
above record. The concurrence constituted by these 

10 two records is recorded concurrence. 

In order to discriminate the fields, since first 
field, second field and the like are difficult to be 
understood, naming is frequently used. The naming or 
title given to the field is called as a field name. 

15 For example, in the aforementioned record, it is 
assumed that the field name of the first field is 
"(person's) name", second field is "address" and third 
field is "phone number". Thus, in the first record, a 
field value of the field name "name" is "Taro Yamada" 

20 and a field value of the field name "address" is 
"Yokohama- city " . 

Data actually representing the record concurrence 
is shown in Fig. 24. In case of the HTML document, 
the table is concretely described as a table (table is 

25 data described by TABLE tags). Fig. 24 shows an 

example of the record concurrence described by the 
table . 
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In this example, while each column of the table 
describes one record, there is a case where the rows 
describe the records. However, since the column and 
the row may be interchanged, i.e., the column and the 
5 row may be converted with respect to a diagonal of the 
table, in the following explanation, it is regarded 
that the records are described in the column 
direction. In the case where the columns represent 
the records, the readings of column and row are 

10 changed, the same result is achieved. In the table 
shown, the first line describes the fields names of 
the fields. Such a line is referred to as a field 
name describing line (i.e., line with the field name). 
The second and third lines describe one record, 

15 respectively. Such a line is referred to as a record 
describing line (i.e., line with record). 

In the aforementioned embodiments, in order to 
judge whether the table is the table describing the 
table is or not, the judgement was effected under a 

20 assumption of the table in which M columns and N rows 
are not omitted and regular description is made. 
However, among the tables in the HTML document, there 
are tables in which a plurality of tables are included 
in one table or the record straddles between plural 

25 table. Further, there are also multi-row and 

multi-column tables in which, when the adjacent 
informations are the same, the informations are 
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gathered "to be described as single information. 
Regarding such tables, the table judgement cannot be 
effected easily. 

For these tables, by analyzing a structure of the 
5 table and regularity of information description 

constituting the table, and by reforming the table 
regularly in M columns and N rows, the table can be 
divided correctly. 

Fig. 25 is a block diagram showing a construction 
10 of an apparatus according to an embodiment of the 
present invention. 

An HTML table reformation part 2501 serves to 
reform a table stored in the HTML table storage part 
101 regularly without omission of M columns and N rows 
15 by analyzing the structure of the table and regularity 
of information description constituting the table. 

An HTML table reformation data storage part 2502 
serves to store data of the HTML table reformed in the 
HTML table reformation part 2501. 
20 A table analysis part 102 serves to analyze the 

table stored in the HTML table reformation data 
storage art 2502 thereby to generate cell position 
data indicating a positional relationship between the 
cells, and cell vectors representing characteristics 
25 of the cells and data of the cells. The other 

constructions are the same as those shown in Fig. 1. 
Next, an operation of the document dividing 
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apparatus according to the illustrated embodiment will 
be explained with reference to a flow chart shown in 
Fig. 26. 

In a step S2600, regarding the table stored in 
5 the HTML table storage part 101, by analyzing the 

structure of the table and regularity of information 
description constituting the table, the table is 
reformed regularly without omission of M columns and N 
rows. And, the program goes to a step S2601. 

10 The table reformation includes table reformations 

based on supplementary data removal, treatment of a 
multi-row/multi-column table and treatment of a 
composite table. In the illustrated embodiment, the 
table reformation is effected by the supplementary 

15 data removal. The table reformations based on the 
treatment of a multi-row/multi-column table and 
treatment of a composite table will be described in 
connection with other embodiments. Steps S2601 to 
S2607 are the same as the steps S301 to S307 in Fig. 

20 3. 

In the illustrated embodiment, the supplementary 
data removal is effected by the HTML table reformation 
part 2501. Here, referring to the table data stored 
in the HTML table storage part 101, unnecessary data 
25 added to the table in the table is removed. 

Next, the details of the HTML table reformation 
in the step S2600 will be explained with reference to 
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Fig. 27. 

In a step S2701, a region of the field name 
describing line (line with field name) with the TH 
tags is judged, and in a step S2702, a region of the 
5 field name describing line with tags describing the 
background color is judged, and, in a step S2703, a 
region of the field name describing line with tags for 
bold face is checked, and the program goes to a step 
S2704. 

10 In the step S2704, on the basis of the regions of 

the lines with the field name checked in the steps 
S2701 to S2703, meaning similarity between the field 
names of the lines with the field name and fields 
perpendicular to the describing directions of the 

15 lines with the field name is calculated. Since the 
field having high score of similarity is description 
in the field name, by judging the region having high 
score of similarity, the region in the table is 
judged. In a step S2705, the similarity of character 

20 line is calculated in the same procedure as the step 
S2704 to judge the region in the table. 

In a step S2706, on the basis of the regions in 
the table checked in the steps S2704 to S2705, 
excessive data other than the table is removed. 

25 Now, the operation for the supplementary data 

removal will be described by using a sample. Fig. 28 
shows a page of "How to Rear Flowers", in which the 
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supplementary data other than the table are added to 
the first and fourth columns. 

First of all, in the steps S2701 to S2703, the 
lines with the field name are specified. In Fig. 28, 
5 since there are field name describing lines with bold 
face in the second line, by the processing in the step 
S2703, it is judged that the second line is the field 
name describing line. 

Then, in the steps S2704 and S2705, the region in 

10 the table, i.e., a range of the field value regarding 
the field name is specified on the basis of the 
similarity of thesaurus or similarity of character 
line. In Fig. 28, from third to fifth lines in the 
first column, since "violet", "morning glory" and 

15 "balsam" which are the field values regarding the 
field name "flower name" are described, by the 
processing in the step S2704, it is judged that the 
table has the region corresponding to second to fifth 
lines. 

20 Lastly, by the processing in the step S2706, by 

removing the supplementary data out of the region in 
the table, the contents of the table can be picked up. 

As mentioned above, regarding the table to be 
processed, by analyzing the structure of the table an 

25 regularity of information description constituting the 
table and by reforming the table regularly in M 
columns and N rows, the table can be divided 
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correctly. 

[Twelfth Embodiment] 

In a twelfth embodiment of the present invention, 
the HTML table reformation part 2501 effects the 
5 multi-row/multi-column table treatment. Here, by 

analyzing the structure of the table with reference to 
the table data stored in the HTML table storage part 
101, the table is reformed regularly without omission 
of M columns and N rows. 

10 Next, the details of the HTML table reformation 

in the step S2600 will be explained with reference to 
Figs. 29A to 29E. 

When the multi-row/multi-column table is stored 
every similar tables, (1) by corresponding the 

15 structure of the field of the line with the field name 
to the structure of the field of the record portion, 
the record can be picked up, (2) the record can be 
picked up by matching the structure of the field of 
the field name describing line with the structure of 

20 the field of the record, and (3) the record can be 
picked up by re-reading the field portion including 
the multi-row/multi-column. A flow of the process 
regarding (1) is shown in Figs. 29A to 29C, a flow of 
the process regarding (2) is shown in Fig. 29D and a 

25 flow of the process regarding (3) is shown in Fig. 
29E. 

When the data in the table including the 
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multi-row/multi-column is handled, the field of the 
multi-row or multi-column is divided into minimum unit 
fields which are in turn stored. In this case, 
regarding the data of the fields of the multi- 
5 row/multi-column, the same data are stored in the 
respective fields at the stage of division. For 
example, the multi-row/multi-column shown in Fig. 30A 
is divided into the minimum unit fields which are in 
turn stored. Thus, as shown in Fig. 30B, a table 

10 having four columns and four rows. 

In the above ( 1 ) , by corresponding the structure 
of the field of the line with the field name to the 
structure of the field of the record portion, the 
record is picked up. 

15 First of all, a process for analyzing the 

structure of the field of the line with the field name 
will be explained with reference to Fig. 29A. 

In a step S2901, if the field exists, the program 
goes to a step S2902. If the field does not exist, 

20 the processing of the multi-row/multi-column is ended. 

In the step S2902, data of a line is extracted, 
and, in a step S2903, a region of lines with the field 
name is judged, and then the program goes to a step 
S2904. The region of lines with the field name can be 

25 judged by examining different columns in fields in one 
line presently stored and in the fields in the 
immediately previous line. 
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For example, in the multi-row/multi-column as 
shown in Fig. 30C, since the data is stored by 
dividing into the minimum unit fields, as shown in 
Fig. 30D, the table having four columns and four rows 
5 is obtained. Here, when the same data between the 
fields in the first and second lines is examined, 
since the fields coincide in the first and fourth 
columns, the border between the first and second lines 
is not the border for the line with the field name. 

10 However, when the same data between the fields in the 
second nd third lines is examined, since any fields do 
not coincide, the border between the second and third 
lines becomes border for the line with the field name. 
In this way, the structure of the lines with the field 

15 name can be grasped. 

In the step S2904, if the structure of the lines 
with the field name can be grasped, the program goes 
to(l) . If not grasped, in a step S2905, data of one 
line is stored, and, in a step S2906, it is examined 

20 which structures are given in the fields with the 

field name till the lines which has been examined up 
to now, and the program is returned to the step S2901. 

Next, the processing for picking up the records 
on the basis of the analyzed structures of the fields 

25 with the field name will be explained with reference 
to Fig. 29B. Here, the records in the table in which 
the structure of the field of the lines with the field 
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name are the same as the structure of the fields of 
the records, as shown in Fig. 30E, can be picked up. 
Further, the field is started from the field in the 
first record. 

5 In a step S2907, if the field exist, the program 

goes to a step S2908. If does not exist, the program 
goes to a step S2910. However, if no field exists at 
all, the processing of the multi-row/multi-column is 
ended. 

10 In the step S2908, the data of one line is 

extracted, and, in a step S2909, if the structure of 
the field of the line with the field name coincides 
with the structure of one record, the program is 
returned to the step S2907. If does not coincide, the 

15 program goes to (2) . 

In the step S2910, on the basis of the structure 
of the field of the line with the field name, the 
field information is reformed. 

Next, the processing for picking up the record on 

20 the basis of the analyzed structure of the field of 

the line with the field name will be further explained 
with reference to Fig. 29C. Here, by the structure of 
the field having the field value as shown in Fig. 30F, 
the record can be picked up from the table in which 

25 the corresponding field name described lines are 

different. In this table, the field name describing 
lines are constituted by a plurality of lines. Thus, 
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regarding the fields in the lines with the field name, 
by scanning the record coinciding with the structure 
of the field up to the last line of the table to 
examine correspondence, the records in the table can 
5 be picked up. 

In a step S2911, if the field name of the field 
name describing line exist, the program goes to a step 
S2912. If does not exist, the program goes to a step 
S2918. However, if no field name exists at all, the 
10 processing of the multi-row/multi-column is ended. 

In the step S2912, the data of one line with the 
field name is extracted, and, in a step S2913, if the 
extracted data of one line does not reach the last 
line of the lines with the field name, the program 
15 goes to a step S2914. If reached and if data of one 
line cannot be extracted, the program goes to (§) . 

In the step S2914, if there is a field other than 
the field of the line with the field name, the program 
goes to a step S2915. If does not exist, the program 
20 is returned to the step S2911. However, if no field 
exists at all, the processing of the 
multi-row/multi-column is ended. 

In the step S2915 the data of one line is 
extracted, and, in a step S2916, if the structure of 
25 the field of one line with the field name coincides 

with the structure of the field of one line extracted 
in the step S2915, the program goes to a step S2917. 
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If does not coincide, the program is returned to the 
step S2914. 

In the step S2917, the structure information of 
the field name describing line to which the line 
5 presently scanned coincides is stored, and the program 
is returned to the step S2914. 

In the step S2918, on the basis of the structure 
information stored in the step S2917, the field 
information is reformed. 
10 In the above (2), in the table, since all of the 

field structures of all of the records are coincide, 
the record can be picked up by matching the structure 
of the line with the field name with the field 
structure of the record. Further, the field is 
15 started from the field of the first record. 

In a step S2929 shown in Fig. 29D, if the field 
exists, the program goes to a step S2920. If does not 
exist, the program goes to a step S2923. However, if 
no field exists at all, the processing of the 
20 multi-row/multi-column is ended. 

In the step S2920, the structure of the field of 
one line is examined, and, in a step S2912, if the 
data of one line are all the same, since the table is 
a composite table, the processing of the 
25 multi-row/multi-column is ended. 

Since it is necessary that the field structure of 
all of the records be coincided, in a step S2922, if 
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the field structure of the field of one line examined 
up to now coincides with the structure of the field of 
one line examined in the step S2920, the program is 
returned to the step S2919. If does not coincide, the 
5 program goes to a step (4) . 

In a step S2929, on the basis of the field 
structure of the record, the field information is 
reformed by matching the structure of the line with 
the field name with the field structure of the record. 

10 In the above (3), since the table is a table in 

which the field portions of the field values are 
formed as the multi -row/multi-column, by re-reading 
the field portions having the multi-row/multi-column, 
the record can be picked up. Further, the field is 

15 started from the field of the first record. 

In a step S2924 shown in Fig. 29E, if the field 
exists, the program goes to a step S2925. If does not 
exist, the processing of the multi-row/multi-column is 
ended . 

20 In the step S2925, the structure of the field of 

one line is examined, and the program goes to a step 
S2926. 

The fact that the field portion of the field 
value is more detailed means that this field includes 
25 the multi-row (or multi-column). Thus, in the step 

S2926, as a result that the structure of the field of 
one line is examined, if the structure is more 
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detailed than the field name, the program goes to a 
step S2927. Otherwise, the processing of the 
multi-row/multi-column is ended. 

In the step S2927, on the basis of the structure 
5 of the field of one line examined in the step S2925, 
the field information is reformed by matching the 
structure of the line with the field name with the 
field structure of the record. 

As mentioned above, regarding the table to be 
10 processed, by analyzing the structure of the table and 
regularity of information description constituting the 
table and by reforming the table regularly in M 
columns and N rows, the table can be divided 
correctly. 
15 [Thirteenth Embodiment] 

In a thirteenth embodiment of the present 
invention, the HTML table reformation part 2501 
effects treatment of the composite table. Here, on 
the basis of the table data stored in the HTML table 
20 storage part 101, by analyzing regularity of 

information description, the table is reformed 
regularly without omission of M columns and N rows. 

The "composite table" is a table in which a 
plurality tables are included in a single table and/or 
25 the record straddles between plural lines, so that the 
table analysis cannot be effected easily or simply. 
The composite tables can be sorted into { 1 ) a 
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table in which the line with the field name is 
re-described in the table, (2) a table in which the 
same field names are included in series, (3) a table 
in which a field name (different from the common field 
5 name) and its field value are described on the way of 
the table, (4) a table in which a combination of 
adjacent tables is included in the table, and (5) 
others. Here, analyzing methods regarding the above 
(1) to (4) will be described. 

10 Now, the details of the reformation of the HTML 

table in the step S2600 will be explained with 
reference to Figs. 31A to 31D. 

Fig. 31A is a flow chart for processing the 
composite table in which the line with the field name 

15 is re-described in the table. Here, if the field name 
of each line with the field name is included in the 
record, such data is removed. 

In a step S3101, a field name of one line is 
stored, and, in a step S3102, if the field exists, the 

20 program goes to a step S3103. If does not exist, the 
program goes to (l) . 

In the step S3103, the field of one line is 
stored, and, in a step S3104, the fields of one line 
in the step S3101 is compared with that in the step 

25 S3103, and the program goes to a step S3105 . 

In the step S3105, as a result of comparison in 
the step S3104, if the fields are the same, the 
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program goes to a step S3 105, and, if not the same, in 
a step S3 106, the field information is reformed. 

Fig. 31B is a flow chart for processing the 
composite table in which the same field names are 
5 included in series. Here, when the field name of the 
line with the field name is described by plural times, 
arrangement of data is modified. 

In a step S3107, if the field exist, the program 
goes to a step S3108. If does not exist, the program 
10 goes to a step S3112. However, if no field exists at 
all, the processing of the composite table is ended. 

In the step S3108, one field name is stored, and 
the program goes to a step S3109. This field name is 
used for examining whether the same field name is 
15 described in the field name describing lines or not. 

In the step S3109, all of the fields all of the 
fields of the lines with the field name are stored, 
and, in a step S3 110, if the same field name exists in 
the lines with the field name, the program goes to a 
20 step S3111; whereas, if does not exist, the program 
goes to (5) . 

In the step S3111, if the field names from lines 
regularly, the program is returned to the step S3107; 
whereas, if not, the program goes to (2) . 
25 In the step S3112, the reformation of the field 

information and reformation of positional relation 
graph are effected. For example, in Fig. 32A, the 
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field names "ooo", "xxx" "aaa" form lines two times. 
Thus, by storing data of first series (portion shown 
by gray color) and then storing data of second series 
(portion shown by white color), the reformation is 
5 effected. 

Fig. 31C shows a flow for processing a composite 
table in which field names different from the common 
field names and their field values exist on the way of 
the table. Here, when the field name describing lines 

10 in which the field means are changed partially are 
re-described and data for new lines with the field 
name are described in further fields, processing for 
correcting the order of data. 

In a step S3113, a field name of a line is 

15 stored, and, in a step S3114, if the field exists, the 
program goes to a step S3115. If does not exist, the 
program goes to a step S3119. However, if no field 
exists at all, the processing of the composite table 
is ended. 

20 In the step S3115, a field of a line is stored, 

and, in a step S3116, the fields of one line in the 
steps S3113 and S3115 are compared, and the program 
goes to a step S3117. 

In the step S3 117, as a result of comparison in 

25 the step S3116, if another field exists, the program 
goes to a step S3118; whereas, if does not exist, the 
program is returned to the step S3114. 
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In a step S3119, reformation of the field 
information and reformation of the positional 
relationship graph are effected. 

For example, in Fig. 32B, there are field names 
5 "ooo", "xxx", "aaa" and "OOO", "®®@". Thus, 

the field names are made to "ooo" , "xxx", "aaa", 

@®@" , and such data are stored and the 
reformation is performed. 

Fig. 31D shows a flow of processing for a 
10 composite table in which there are a plurality of 
tables (lists) in the table. Here, when the field 
names are common and a plurality of tables are 
described in the single table, processing for dividing 
the tables or lists individually. 
15 In a step S3120, a field name of a line is 

stored, and, in a step S3121, if the field exists, the 
program goes to a step S3122. If does not exist, the 
program goes to a step S3128. However, if no field 
exists at all, the processing of the composite table 
20 is ended. 

In the step S3122, a field of a line is stored, 
and, in a step S3123, all of the fields stored in the 
step S3 122 up to now are stored, and the program goes 
to a step S3124. 
25 In the step S3124, if the same data exist in a 

line, since such data is a title, the program goes to 
a step S3125 to form a new table. If does not exist, 
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the program is returned to the step S3121. However, 
at a first time, the program does not go to the step 
S3125 but is returned to the step S3121. 

In steps S3125 and S3126, objects of new field 
information object and new positional relationship are 
generated, and the program goes to a step S3 127, where 
reformation of the field information is performed. 

For example, in Fig. 32C, regarding the common 
field name, title 1 is described in a second line and 
title 2 is described in a fourth line. First of all, 
if there is the title 1 at the first time, since there 
is no data, a new table is not generated; if there is 
the title 2 at the second time, since the data 
regarding the title 1 has already been stored, a new 
table regarding the title 1 is generated. Lastly, if 
there is no field, since the data regarding the title 
2 has already been stored, a new table regarding the 
title 2 is generated. 

In a step S3 128 and further steps, since the 
processing of the last title is not completed, 
post- treatment is performed. 

First of all, in the step S3128, if the same data 
exist in a line, the program goes to a step S3129 to 
form a new table. If does not exist, the processing 
of the composite table is ended. 

In steps S3129 and S3130, objects of new field 
information object and new positional relationship are 
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generated, and the program goes to a step S3131, where 
reformation of the field information is performed, and 
then the processing of the composite table is ended. 
As mentioned above, regarding the table to be 
5 processed, by analyzing the structure of the table and 
regularity of information description constituting the 
table and by reforming the table regularly without 
omission of M columns and N rows, the table can be 
judged. 

10 [Fourteenth Embodiment] 

In a fourteenth embodiment of the present 
invention, the HTML table reformation part 2501 is 
constituted by a supplementary data removal part 3301 
and a multi-column/multi-row processing art 3302, as 
15 shown in Fig. 33. 

Now, the details of the reformation of the HTML 
table in the step S2600 will be explained with 
reference to Fig. 34. 

In a step S3401, supplementary data is removed 
20 from the HTML table, and, in a step S3402, by 

analyzing the structure of the table with reference to 
the table data from which the supplementary data is 
removed, the table is reformed regularly without 
omission of M columns and N rows, and the processing 
25 is ended. 

As mentioned above, regarding the table to be 
processed, by analyzing the structure of the table and 
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regularity of information description constituting the 
table and by reforming the table regularly without 
omission of M columns and N rows, the table can be 
judged . 

5 [Fifteenth Embodiment] 

In a fifteenth embodiment of the present 
invention,, the HTML table reformation part 2501 is 
constituted by a supplementary data removal part 3501 
and a composite table processing part 3502, as shown 
10 in Fig. 35. 

Now, the details of the reformation of the HTML 
table in the step S2600 will be explained with 
reference to Fig. 36. 

In a step S3601, supplementary data is removed 
15 from the HTML table, and, in a step S3602, by 

analyzing the regularity of the information 
description with reference to the table data from 
which the supplementary data is removed, the table is 
reformed regularly without omission of M columns and N 
20 rows, and the processing is ended - 

As mentioned above, regarding the table to be 
processed, by analyzing the structure of the table and 
regularity of information description constituting the 
table and by reforming the table regularly without 
25 omission of M columns and N rows, the table can be 
judged. 
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[Sixteenth Embodiment] 

In a sixteenth embodiment of the present 
invention, the HTML table reformation part 2501 is 
constituted by a supplementary data removal part 3701, 
5 a multi-column/multi-row processing part 3702 and a 

composite table processing part 3703, as shown in Fig. 
37. 

Now, the details of the reformation of the HTML 
table in the step S2600 will be explained with 

10 reference to Fig. 38. In a step S3801, supplementary 
data is removed from the HTML table, and, in a step 
S3802, by analyzing the structure of the table with 
reference to the table data from which the 
supplementary data is removed, the table is reformed 

15 regularly without omission of M columns and N rows, 

and the program goes to a step S3803. 

In the step S3803, by analyzing the regularity of 
information description with reference to the 
reformation data of the step S3802, the table is 

20 reformed regularly without omission of M columns and N 
rows, and the processing is ended. 

As mentioned above, regarding the table to be 
processed, by analyzing the structure of the table and 
regularity of information description constituting the 

25 table and by reforming the table regularly without 
omission of M columns and N rows, the table can be 
judged. 
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[Seventeenth Embodiment] 

In a seventeenth embodiment of the present 
invention, the HTML table reformation part 2501 is 
constituted by a multi-column/multi-row processing 
5 part 3901 and a composite table processing art 3902, 
as shown in Fig. 39. 

Now, the details of the reformation of the HTML 
table in the step S2600 will be explained with 
reference to Fig. 40. 

10 In a step S4001, by analyzing the structure of 

the table with reference to the table data from which 
the supplementary data is removed, the table is 
reformed regularly without omission of M columns and N 
rows, and the program goes to a step S4002 . 

15 In the step S4002, by analyzing the regularity of 

information description with reference to the 
reformation data of the step S4001, the table is 
reformed regularly without omission of M columns and N 
rows, and the processing is ended. 

20 As mentioned above, regarding the table to be 

processed, by analyzing the structure of the table and 
regularity of information description constituting the 
table and by reforming the table regularly without 
omission of M columns and N rows, the table can be 

25 judged. 

Incidentally, the present invention may be 
applied to a system including a plurality of 
equipments (for example, a computer body, an interface 



- 68 - 



equipment, a display and the like) or a system 
including a single equipment, so long as the functions 
of the above-mentioned embodiments can be realized. 
Further, a technique in which, for the purpose 
5 for operating various devices to realize functions of 

the above-mentioned embodiments, software program code 
for realizing functions of the above-mentioned 
embodiments is supplied to a computer (or CPU or MPU) 
in an apparatus or a system connected to various 

10 devices so that the various devices are operated by 
the computer in the apparatus or the system in 
accordance with the program code is also included 
within the scope of the invention. Further, in this 
case, the program code itself read out from a 

15 recording medium realizes the functions of the 

above-mentioned embodiments, and, thus, the program 
code itself and means for supplying the program code 
to the computer (for example, recording medium storing 
the program code) constitute the present invention. 

20 The recording medium for supplying the program 

code may be, for example, a floppy disk, a hard disk, 
an optical disk, a photo-magnetic disk, CD-ROM, CD-R, 
a magnetic tape, a non-volatile memory card or ROM. 
Further, when not only the functions of the 

25 above-mentioned embodiments are realized by carrying 
out the program code read out from the computer but 
also the functions of the above-mentioned embodiments 
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are realized by cooperation with OS (operating system) 
operating on the computer or other application 
software on the basis of instruction of the program 
code, such program code is included within the scope 
5 of the invention. 

Further, of course, the present invention 
includes a technique in which, after the program code 
read out from the recording medium is written in a 
memory of a function expansion board inserted into the 

10 computer or a function expansion unit connected to the 
computer, a CPU of the function expansion board or the 
function expansion unit carries out the actual 
processing partially or totally on the basis of 
instruction of the program code, thereby realizing the 

15 functions of the above-mentioned embodiments. 

When the present invention is applied to the 
above-mentioned recording medium, program codes 
corresponding the above-mentioned flow charts may be 
stored in the recording medium. 

20 Although the present invention has been described 

in its preferred forms with a certain degree of 
particularity, many apparently widely different 
embodiments of the invention can be made without 
departing from the spirit and the scope thereof. 

25 It is to be understood that the invention is not 

limited to the specific embodiments thereof except as 
defined in the appended claims. 
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WHAT IS CLAIMED IS: 

1. A document; segmentation apparatus comprising: 
table analyzing means for generating cell 

position data indicating a positional relationship 
5 between cells and cell vectors representing 

characteristics of the cells, by analyzing a table in 
a document to be processed; 

table type judging means for judging a table 
type with reference to the cell position data and the 
10 cell vectors generated by said table analyzing means; 

first segment generating means for generating 
a segment from the table when the table type is a 
table for showing a table; and 

second segment generating means for 
15 generating a segment from the table when the table 
type is a table for layout. 

2 . A document segmentation apparatus according 
to claim 1, wherein said first segment generating 

20 means comprise; 

cut direction determination means for 
determining a cut direction of the table by judging 
whether the data is expressed in a column or a row in 
the table on the basis of the cell position data and 
25 the cell vectors; and 

table segment generating means for generating 
a table segment by dividing the table on the basis of 
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the table type and the cut direction. 

3 . A document segmentation apparatus according 
to claim 2, wherein said second segment generating 

5 means generate the table itself as the segment. 

4. A document segmentation apparatus according 
to claim 1, wherein said second segment generating 
means comprise; 

10 cell cluster generating means for generating 

cell cluster information by clustering the cells in 

the table; and 

layout segment generating means for 

generating segment by connecting the cells in the 
15 table with reference to the cell position data and the 

cell cluster information. 

5 . A document segmentation apparatus according 
to claim 4, wherein said first segment generating 

20 means generate the table itself as the segment. 

6 . A document segmentation apparatus according 
to claim 4, wherein said second segment generating 
means generate the table itself as the segment. 

25 

7. A document segmentation apparatus according 
to claim 1, further comprising normal segment 
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generating means for dividing the document into a 
segment which corresponds to one table; 
and wherein 

the table generated as one segment by said 
5 normal segment generating means is to be processed by- 
said table analyzing means. 

8 . A document segmentation apparatus according 
to claim 1, wherein said table analyzing means further 

10 generate cell data of the analyzed table and said 
table type judging means judge the table type with 
reference to the cell data. 

9 . A document segmentation apparatus according 
15 to claim 8, wherein said table type judging means 

comprise similarity judging means for judging the 
table type on the basis of similarity between the cell 
data positioned at particular positions with reference 
to the cell position data and the cell data generated 
20 by said table analyzing means. 

10. A document segmentation apparatus according 
to claim 8, wherein said table type judging means 
comprise partial character line extracting means for 

25 extracting partial character lines from the cell data 
positioned at a particular position with reference to 
the cell position data and the cell data generated by 
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said table analyzing means, and character line 
comparing means for comparing the extracted partial 
character lines to judge the table type. 

5 11. A document segmentation apparatus according 

to claim 8, wherein said table type judging means 
comprise partial character line extracting means for 
extracting partial character lines from the cell data 
positioned at a particular position with reference to 
10 the cell position data and the cell data generated by 
said table analyzing means, and similarity judging 
means for judging the table type on the basis of 
similarity between the extracted partial character 
lines . 

15 

12. A document segmentation apparatus according 
to claim 8, wherein said table type judging means 
comprise syntax judging means for judging the table 
type with reference to the cell position data, the 

20 cell vectors and the cell data generated by said table 
analyzing means, and similarity judging means for 
judging the table type on the basis of similarity 
between the cell data positioned at particular 
positions with reference to the cell position data and 

25 the cell data generated by said table analyzing means. 



13. A document segmentation apparatus according 
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to claim 8, wherein said table type judging means 
comprise syntax judging means for judging the table 
type with reference to the cell position data, the 
cell vectors and the cell data generated by said table 
5 analyzing means, partial character line extracting 

means for extracting partial character lines from the 
cell data positioned at a particular position with 
reference to the cell position data and the cell data 
generated by said table analyzing means, and character 
10 line comparing means for comparing the extracted 
partial character lines to judge the table type. 

14. A document segmentation apparatus according 
to claim 8, wherein said table type judging means 

15 comprise syntax judging means for judging the table 
type with reference to the cell position data, the 
cell vectors and the cell data generated by said table 
analyzing means, partial character line extracting 
means for extracting partial character lines from the 

20 cell data positioned at a particular position with 

reference to the cell position data and the cell data 
generated by said table analyzing means, and 
similarity judging means for judging the table type on 
the basis of similarity between the extracted partial 

25 character lines. 



15. A document segmentation apparatus according 
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to claim 1, further comprising table reforming means 
for reforming the table so that the number of cells in 
each column and each row becomes the same, by 
analyzing the table to be processed; 
5 and wherein 

said table analyzing means analyze the 
reformed table. 



16. A document segmentation apparatus according 
10 to claim 15, wherein said table reforming means 
comprise supplementary data removing means for 
removing data added to the table from the table data. 



17. A document segmentation apparatus according 
15 to claim 15, wherein said table reforming means 

comprise multi-row/multi-column processing means for 
reforming the table regularly by analyzing the 
structure of the table data. 



20 18. A document segmentation apparatus according 

to claim 15, wherein said table reforming means 
comprise composite table processing means for 
reforming the table by analyzing regularity of 
information description constituting the table. 



19. A document segmentation apparatus according 
to claim 15, wherein said table reforming means 
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comprise; 

supplementary data removing means for 
removing data added to the table from the table data; 
and 

5 multi-row/multi-column processing means for 

reforming the table regularly by analyzing the 
structure of the table data. 



20. A document segmentation apparatus according 
10 to claim 15, wherein said table reforming means 
comprise; 

supplementary data removing means for 
removing data added to the table from the table data; 
and 

15 composite table processing means for 

reforming the table by analyzing regularity of 
information description constituting the table. 



21. A document segmentation apparatus according 
20 to claim 15, wherein said table reforming means 
comprise; 

multi-row/multi-column processing means for 
reforming the table regularly by analyzing the 
structure of the table data; and 
25 composite table processing means for 

reforming the table by analyzing regularity of 
information description constituting the table. 
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22. A document segmentation apparatus according 
to claim 15, wherein said table reforming means 
comprise: 

supplementary data removing means for 
5 removing data added to the table from the table data; 

multi-row/multi-column processing means for 
reforming the table regularly by analyzing the 
structure of the table data; and 

composite table processing means for 
10 reforming the table by analyzing regularity of 
information description constituting the table. 

23. A document segmentation method comprising: 
a table analyzing step for generating cell 

15 position data indicating a positional relationship 
between cells and cell vectors representing 
characteristics of the cells, by analyzing a table in 
a document to be processed; 

a table type judging step for judging a 

20 table type with reference to the cell position data 

and the cell vectors generated by said table analyzing 
step; 

a first segment generating step for 
generating a segment from the table when the table 
25 type is a table describing a table; and 

a second segment generating step for 
generating a segment from the table when the table 
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type is a table for layout. 

24. A document segmentation method according to 
claim 23, wherein said first segment generaitng step 
5 comprises: 

a cut direction determination step for 
determining a cut direction of the table by judging 
whether the data is expressed in a column or a row in 
the table on the basis of the cell position data and 
10 the cell vectors; and 

a table segment generating step for 
generating a table segment by dividing the table on 
the basis of the table type and the cut direction. 

15 25. A document segmentation method according to 

claim 24, wherein said second segment generating step 
generates the table itself as the segment. 

26. A document segmentation method according to 
20 claim 23, wherein said second segment generating step 
comprises; 

a cell cluster generating step for 
generating cell cluster information by clustering the 
cells in the table; and 
25 a layout segment generating step for 

generating segment by connecting the cells in the 
table with reference to the cell position data and the 
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cell cluster Information. 



27. A document segmentation method according to 
claim 26, wherein said first segment generating step 

5 generates the table itself as the segment. 

28. A document segmentation method according to 
claim 26, wherein said second segment generating step 
generates the table itself as the segment. 

10 

29 . A document segmentation method according to 
claim 23, further comprising a normal segment 
generating step for dividing the document into a 
segment which corresponds to one table; 

15 and wherein 

the table generated as one segment by said 
normal segment generating step is to be processed by 
said table analyzing step. 

20 30. A document segmentation method according to 

claim 23, wherein said table analyzing step further 
generates cell data of the analyzed table and said 
table type judging step judges the table type with 
reference to the cell data. 



31. A document segmentation method according to 
claim 30, wherein said table type judging step 
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comprises a similarity judging step for judging the 
table type on the basis of similarity between the cell 
data positioned at particular positions with reference 
to the cell position data and the cell data generated 
5 by said table analyzing step. 

32. A document segmentation method according to 
claim 30, wherein said table type judging step 
comprises a partial character line extracting step for 

10 extracting partial character lines from the cell data 
positioned at a particular position with reference to 
the cell position data and the cell data generated by 
said table analyzing step, and a character line 
comparing step for comparing the extracted partial 

15 character lines to judge the table type. 

33 . A document segmentation method according to 
claim 30, wherein said table type judging step 
comprises a partial character line extracting means 

20 for extracting partial character lines from the cell 
data positioned at a particular position with 
reference to the cell position data and the cell data 
generated by said table analyzing step, and a 
similarity judging step for judging the table type on 

25 the basis of similarity between the extracted partial 
character lines. 
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34. A document segmentation method according to 
claim 30, wherein said table type judging step 
comprises a syntax judging step for judging the table 
type with reference to the cell position data, the 

5 cell vectors and the cell data generated by said table 
analyzing step, and a similarity judging step for 
judging the table type on the basis of similarity 
between the cell data positioned at particular 
positions with reference to the cell position data and 
10 the cell data generated by said table analyzing step. 

35. A document segmentation method according to 
claim 30, wherein said table type judging step 
comprises a syntax judging step for judging the table 

15 type with reference to the cell position data, the 

cell vectors and the cell data generated by said table 
analyzing step, a partial character line extracting 
step for extracting partial character lines from the 
cell data positioned at a particular position with 

20 reference to the cell position data and the cell data 
generated by said table analyzing step, and a 
character line comparing step for comparing the 
extracted partial character lines to judge the table 
type. 

25 

36. A document segmentation method according to 
claim 30, wherein said table type judging step 
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comprises a syntax judging step for judging the table 
type with reference to the cell position data, the 
cell vectors and the cell data generated by said table 
analyzing step, a partial character line extracting 
5 step for extracting partial character lines from the 
cell data positioned at a particular position with 
reference to the cell position data and the cell data 
generated by said table analyzing step, and a 
similarity judging means for judging the table type on 
10 the basis of similarity between the extracted partial 
character lines. 

37. A document segmentation method according to 
claim 23, further comprising a table reforming step 

15 for reforming the table so that the number of cells in 
each column and each row becomes the same, by 
analyzing the table to be processed; 
and wherein 

said table analyzing step analyzes the 

20 reformed table. 

38. A document segmentation method according to 
claim 37, wherein said table reforming step comprises 
a supplementary data removing step for removing data 

25 added to the table from the table data. 



39. A document segmentation method according to 
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claim 37, wherein said table reforming step comprises 
a multi-row/multi-column processing step for reforming 
the table regularly by analyzing the structure of the 
table data. 

5 

40. A document segmentation method according to 
claim 37, wherein said table reforming step comprises 
a composite table processing step for reforming the 
table by analyzing regularity of information 

10 description constituting the table. 

41. A document segmentation method according to 
claim 37, wherein said table reforming step comprises; 

a supplementary data removing step for 
15 removing data added to the table from the table data; 
and 

a multi-row/multi-column processing step for 
reforming the table regularly by analyzing the 
structure of the table data. 

20 

42. A document segmentation method according to 
claim 37, wherein said table reforming step comprises; 

a supplementary data removing step for 
removing data added to the table from the table data; 
25 and 

a composite table processing step for 
reforming the table by analyzing regularity of 



- 84 - 



information description constituting the table. 

43 . A document segmentation method according to 
claim 37, wherein said table reforming step comprises; 

5 a multi -row/multi-column processing step for 

reforming the table regularly by analyzing the 
structure of the table data; and 

a composite table processing step for 
reforming the table by analyzing regularity of 
10 information description constituting the table. 

44. A document segmentation method according to 
claim 37, wherein said table reforming step comprises; 

a supplementary data removing step for 
15 removing data added to the table from the table data; 

a multi-row/multi-column processing step for 
reforming the table regularly by analyzing the 
structure of the table data; and 

a composite table processing step for 
20 reforming the table by analyzing regularity of 
information description constituting the table. 

45. A computer-readable storage medium storing a 
document segmentation program for controlling a 

25 computer to perform document segmentation, said 

program comprising codes for causing the computer to 
perform: 
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a table analyzing step for generating cell 
position data indicating a positional relationship 
between cells and cell vectors representing 
characteristics of the cells r by analyzing a table in 
a document to be processed; 

a table type judging step for judging a 
table type with reference to the cell position data 
and the cell vectors generated by said table analyzing 
step; 

a first segment generating step for 
generating a segment from the table when the table 
type is a table describing a table; and 

a second segment generating step for 
generating a segment from the table when the table 
type is a table for layout. 
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ABSTRACT OF THE DISCLOSURE 

A table in an HTML document is analyzed to 
generate cell position data indicating a positional 
relationship between cells and cell vectors 
5 representing characteristics of the cells, and a table 
type is judged with reference to the cell position 
data and the cell vectors, and, if the table type is a 
table describing a table, it is judged whether the 
data is represented in a column or a row with 

10 reference to the cell position data and the cell 
vectors, and a cut direction of the table is 
determined, and segments are generated with reference 
to the table type and the cut direction. If the table 
type is a table for layout, the cells are clustered 

15 with reference to the cell vectors, and the segments 

are generated with reference to the cell position data 
and cell cluster information. 
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